These days, most of the famous sports in the United States such as Football(NFL), Basketball(NBA), Baseball(MLB), and Hockey(NHL) try to gather data as much as they can to perform better and beat the opponents. Among those sports, baseball(MLB) has the biggest data and they started analyzing those data using Sabermetrics. Compare to other sports, baseball is more known for information sport. Because, in other sports, players bring the ball with them so it is hard to track what players do in that short time, but for baseball, when the pitchers throw the ball no other distraction comes to that situation. Baseball games are more team sport than other sports, so it cannot be decided by one or two superstars to decide the outcome of the winning or losing[1]. Therefore, in baseball, it is easier to predict which team will win or lose compare to other sports. So in baseball, it is hard for underdog to beat overdog in real game.
There are three different types of teams stats, which are hitting stats, pitching stats and fielding stats. Hitting and pithcing stats are the standard stats that we can easily find while watching games. The batting stats include which team hits more and scores more, and the pitching stats include which team hits more and scores more to the opponents. Fielding stats are while in the defense innings, how many double plays or errors player made. Advanced stats that are used in MLB are BABIP(Batting Average on Balls in Play), DER(Defensive Efficiency Rating), ISO(teams' raw power), etc[2]. Few years ago, for the batting stat, MLB uses only simple batting average, for the pitching stat, they use simple ERA(Earned Run Average). But nowadays, as I mentioned above, MLB has few more advanced stats that are more useful to evaluate palyers' competence, also evaluate total teams' hitting, ptiching and fielding rank. With using those advanced stats, it is going to easier and accuracy to predict which teams are making a playoff with the regular season records.
From that idea, in this project, I would like to predict which teams make playoff using the given data and compare with the actual teams who make playoff. To find out which teams will make playoff, I need to figure out which variables on hitting or pitching data have the importance on the modeling. Also, if needed, find out more advanced variables that are used in the real baseball game. As MLB is still growing business, from this modeling, it helps fans, team general managers and sponsor companies to follow up the games. Thus, the goal of this project was to use classification to predict playoff through Logistic Regression, K-Nearest Neighbors(KNN), Decision Tree, Random Forest Classifier, XGBoost, Support Vector Machine given a input features relating to the batting, pitching and fielding(defense).
The "Teams" dataset contains a total of 52 columns and it team make playoff by winning either division or wild card spot. First few columns show which teams, what division they included, either making playoff or not, and ball park name. Then, 4 columns show how many games each team played and wins/loss by teams. After that, I could divide columns by 3 categories, batting, pitching and fielding. In those columns, it includes each teams' average batting, pitching and fielding stats. To figure out which teams make playoff, column name DivWin and WCWin should be 'Y'. When teams win their division they automatically go to the playoff, and teams who win wilcard game they could also get playoff ticket. Therefore, using those two columns make a new column when team satisfy either coditions.
%pip install seaborn
import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("default")
sns.set(font_scale=1)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.simplefilter(action = 'ignore',category = FutureWarning)
Requirement already satisfied: seaborn in c:\users\roymy\anaconda3\lib\site-packages (0.12.2) Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\roymy\anaconda3\lib\site-packages (from seaborn) (1.24.3) Requirement already satisfied: pandas>=0.25 in c:\users\roymy\anaconda3\lib\site-packages (from seaborn) (1.5.3) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\roymy\anaconda3\lib\site-packages (from seaborn) (3.7.1) Requirement already satisfied: contourpy>=1.0.1 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.5) Requirement already satisfied: cycler>=0.10 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\roymy\appdata\roaming\python\python311\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in c:\users\roymy\appdata\roaming\python\python311\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\roymy\anaconda3\lib\site-packages (from pandas>=0.25->seaborn) (2022.7) Requirement already satisfied: six>=1.5 in c:\users\roymy\appdata\roaming\python\python311\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0) Note: you may need to restart the kernel to use updated packages.
#Import the dataset
df = pd.read_csv("Teams.csv")
df.tail(5)
| yearID | lgID | teamID | franchID | divID | Rank | G | Ghome | W | L | ... | DP | FP | name | park | attendance | BPF | PPF | teamIDBR | teamIDlahman45 | teamIDretro | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2980 | 2021 | NL | SLN | STL | C | 2 | 162 | 81.0 | 90 | 72 | ... | 137 | 0.986 | St. Louis Cardinals | Busch Stadium III | 2102530.0 | 92 | 92 | STL | SLN | SLN |
| 2981 | 2021 | AL | TBA | TBD | E | 1 | 162 | 81.0 | 100 | 62 | ... | 130 | 0.986 | Tampa Bay Rays | Tropicana Field | 761072.0 | 92 | 91 | TBR | TBA | TBA |
| 2982 | 2021 | AL | TEX | TEX | W | 5 | 162 | 81.0 | 60 | 102 | ... | 146 | 0.986 | Texas Rangers | Globe Life Field | 2110258.0 | 99 | 101 | TEX | TEX | TEX |
| 2983 | 2021 | AL | TOR | TOR | E | 4 | 162 | 80.0 | 91 | 71 | ... | 122 | 0.984 | Toronto Blue Jays | Sahlen Field | 805901.0 | 102 | 101 | TOR | TOR | TOR |
| 2984 | 2021 | NL | WAS | WSN | E | 5 | 162 | 81.0 | 65 | 97 | ... | 116 | 0.983 | Washington Nationals | Nationals Park | 1465543.0 | 95 | 96 | WSN | MON | WAS |
5 rows × 48 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2985 entries, 0 to 2984 Data columns (total 48 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 yearID 2985 non-null int64 1 lgID 2935 non-null object 2 teamID 2985 non-null object 3 franchID 2985 non-null object 4 divID 1468 non-null object 5 Rank 2985 non-null int64 6 G 2985 non-null int64 7 Ghome 2586 non-null float64 8 W 2985 non-null int64 9 L 2985 non-null int64 10 DivWin 1440 non-null object 11 WCWin 804 non-null object 12 LgWin 2957 non-null object 13 WSWin 2628 non-null object 14 R 2985 non-null int64 15 AB 2985 non-null int64 16 H 2985 non-null int64 17 2B 2985 non-null int64 18 3B 2985 non-null int64 19 HR 2985 non-null int64 20 BB 2984 non-null float64 21 SO 2969 non-null float64 22 SB 2859 non-null float64 23 CS 2153 non-null float64 24 HBP 1827 non-null float64 25 SF 1444 non-null float64 26 RA 2985 non-null int64 27 ER 2985 non-null int64 28 ERA 2985 non-null float64 29 CG 2985 non-null int64 30 SHO 2985 non-null int64 31 SV 2985 non-null int64 32 IPouts 2985 non-null int64 33 HA 2985 non-null int64 34 HRA 2985 non-null int64 35 BBA 2985 non-null int64 36 SOA 2985 non-null int64 37 E 2985 non-null int64 38 DP 2985 non-null int64 39 FP 2985 non-null float64 40 name 2985 non-null object 41 park 2951 non-null object 42 attendance 2706 non-null float64 43 BPF 2985 non-null int64 44 PPF 2985 non-null int64 45 teamIDBR 2985 non-null object 46 teamIDlahman45 2985 non-null object 47 teamIDretro 2985 non-null object dtypes: float64(10), int64(25), object(13) memory usage: 1.1+ MB
Before beginning the data preprocessing, I imported the proper packages and loaded in the datasets. As dataset includes the year from 1871, few teams have changed their locations or disappeared from the league. Also, for the wildcard series has been established since 2012, there are few missing values for the WCWin. As this data set is about sport, except the team names, DivWin, WCWin, LgWin, and WSWin, most of the columns are numeric type. To check how big the data set is there are 143280 observations on the data set and there are only 7.02%, 10064 missing values. And to check which teams were existed since 1871, I have checked the team names.
#Check for missing data
print("The total number of data: ", df.shape[0]*df.shape[1])
print("The total number of null values: {} and it occupies {:.2f}% of the toal ".format(df.isnull().sum().sum(), (df.isnull().sum().sum()*100)/(df.shape[0]*df.shape[1])))
print("The number of teams: ", df['franchID'].unique())
The total number of data: 143280 The total number of null values: 10064 and it occupies 7.02% of the toal The number of teams: ['BNA' 'CNA' 'CFC' 'KEK' 'NNA' 'PNA' 'ROK' 'TRO' 'OLY' 'BLC' 'ECK' 'BRA' 'MAN' 'NAT' 'MAR' 'RES' 'PWS' 'WBL' 'HNA' 'WES' 'NHV' 'CEN' 'SLR' 'SNA' 'WNT' 'ATL' 'CHC' 'CNR' 'HAR' 'LGR' 'NYU' 'ATH' 'SBS' 'IBL' 'MLG' 'PRO' 'BUF' 'CBL' 'SYR' 'TRT' 'WOR' 'DTN' 'BLO' 'CIN' 'LOU' 'PHA' 'PIT' 'STL' 'CBK' 'SFG' 'NYP' 'PHI' 'ALT' 'BLU' 'LAD' 'BRD' 'CPI' 'COR' 'IHO' 'KCU' 'MLU' 'PHK' 'RIC' 'SLM' 'STP' 'TOL' 'WIL' 'WST' 'WNA' 'KCN' 'WNL' 'CLV' 'IND' 'KCC' 'CLS' 'BFB' 'BRG' 'BWW' 'BRS' 'CHP' 'CLI' 'NYI' 'PHQ' 'PBB' 'ROC' 'SYS' 'TLM' 'CKK' 'MLA' 'WAS' 'NYY' 'BOS' 'CHW' 'CLE' 'DET' 'BAL' 'OAK' 'MIN' 'BLT' 'BTT' 'BFL' 'CHH' 'NEW' 'KCP' 'PBS' 'SLI' 'ANA' 'TEX' 'HOU' 'NYM' 'KCR' 'WSN' 'SDP' 'MIL' 'SEA' 'TOR' 'COL' 'FLA' 'ARI' 'TBD']
Most of teams that exist in these days have been pretty much fixed since 1990. Therefore, I would use the data set since 1990. In 1994, there was a players' strike in that season, therefore, there were no playoff and in 2020, as it was a Covid season, it has short season so those two seasons could make a noise for this data set. By that reason, I removed those two seasons.
df = df[df['yearID'] >= 1990] # Select recent 30 years seasons.
df = df[df['yearID'] != 1994] # No playoff season caused by players' STRIKE
df = df[df['yearID'] != 2020] # Short season
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df.head(10)
| yearID | lgID | teamID | franchID | divID | Rank | G | Ghome | W | L | DivWin | WCWin | LgWin | WSWin | R | AB | H | 2B | 3B | HR | BB | SO | SB | CS | HBP | SF | RA | ER | ERA | CG | SHO | SV | IPouts | HA | HRA | BBA | SOA | E | DP | FP | name | park | attendance | BPF | PPF | teamIDBR | teamIDlahman45 | teamIDretro | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2047 | 1990 | NL | ATL | ATL | W | 6 | 162 | 81.0 | 65 | 97 | N | NaN | N | N | 682 | 5504 | 1376 | 263 | 26 | 162 | 473.0 | 1010.0 | 92.0 | 55.0 | 27.0 | 31.0 | 821 | 727 | 4.58 | 17 | 8 | 30 | 4289 | 1527 | 128 | 579 | 938 | 158 | 133 | 0.974 | Atlanta Braves | Atlanta-Fulton County Stadium | 980129.0 | 105 | 106 | ATL | ATL | ATL |
| 2048 | 1990 | AL | BAL | BAL | E | 5 | 161 | 80.0 | 76 | 85 | N | NaN | N | N | 669 | 5410 | 1328 | 234 | 22 | 132 | 660.0 | 962.0 | 94.0 | 52.0 | 40.0 | 41.0 | 698 | 644 | 4.04 | 10 | 5 | 43 | 4306 | 1445 | 161 | 537 | 776 | 93 | 151 | 0.985 | Baltimore Orioles | Memorial Stadium | 2415189.0 | 97 | 98 | BAL | BAL | BAL |
| 2049 | 1990 | AL | BOS | BOS | E | 1 | 162 | 81.0 | 88 | 74 | Y | NaN | N | N | 699 | 5516 | 1502 | 298 | 31 | 106 | 598.0 | 795.0 | 53.0 | 52.0 | 28.0 | 44.0 | 664 | 596 | 3.72 | 15 | 13 | 44 | 4326 | 1439 | 92 | 519 | 997 | 123 | 154 | 0.980 | Boston Red Sox | Fenway Park II | 2528986.0 | 105 | 105 | BOS | BOS | BOS |
| 2050 | 1990 | AL | CAL | ANA | W | 4 | 162 | 81.0 | 80 | 82 | N | NaN | N | N | 690 | 5570 | 1448 | 237 | 27 | 147 | 566.0 | 1000.0 | 69.0 | 43.0 | 28.0 | 45.0 | 706 | 613 | 3.79 | 21 | 13 | 42 | 4362 | 1482 | 106 | 544 | 944 | 142 | 186 | 0.978 | California Angels | Anaheim Stadium | 2555688.0 | 97 | 97 | CAL | CAL | CAL |
| 2051 | 1990 | AL | CHA | CHW | W | 2 | 162 | 80.0 | 94 | 68 | N | NaN | N | N | 682 | 5402 | 1393 | 251 | 44 | 106 | 478.0 | 903.0 | 140.0 | 90.0 | 36.0 | 47.0 | 633 | 581 | 3.61 | 17 | 10 | 68 | 4348 | 1313 | 106 | 548 | 914 | 124 | 169 | 0.980 | Chicago White Sox | Comiskey Park | 2002357.0 | 98 | 98 | CHW | CHA | CHA |
| 2052 | 1990 | NL | CHN | CHC | E | 4 | 162 | 81.0 | 77 | 85 | N | NaN | N | N | 690 | 5600 | 1474 | 240 | 36 | 136 | 406.0 | 869.0 | 151.0 | 50.0 | 30.0 | 51.0 | 774 | 695 | 4.34 | 13 | 7 | 42 | 4328 | 1510 | 121 | 572 | 877 | 124 | 136 | 0.980 | Chicago Cubs | Wrigley Field | 2243791.0 | 108 | 108 | CHC | CHN | CHN |
| 2053 | 1990 | NL | CIN | CIN | W | 1 | 162 | 81.0 | 91 | 71 | Y | NaN | Y | Y | 693 | 5525 | 1466 | 284 | 40 | 125 | 466.0 | 913.0 | 166.0 | 66.0 | 42.0 | 42.0 | 597 | 549 | 3.39 | 14 | 12 | 50 | 4369 | 1338 | 124 | 543 | 1029 | 102 | 126 | 0.983 | Cincinnati Reds | Riverfront Stadium | 2400892.0 | 105 | 105 | CIN | CIN | CIN |
| 2054 | 1990 | AL | CLE | CLE | E | 4 | 162 | 81.0 | 77 | 85 | N | NaN | N | N | 732 | 5485 | 1465 | 266 | 41 | 110 | 458.0 | 836.0 | 107.0 | 52.0 | 29.0 | 61.0 | 737 | 676 | 4.26 | 12 | 10 | 47 | 4282 | 1491 | 163 | 518 | 860 | 117 | 146 | 0.981 | Cleveland Indians | Cleveland Stadium | 1225240.0 | 100 | 100 | CLE | CLE | CLE |
| 2055 | 1990 | AL | DET | DET | E | 3 | 162 | 81.0 | 79 | 83 | N | NaN | N | N | 750 | 5479 | 1418 | 241 | 32 | 172 | 634.0 | 952.0 | 82.0 | 57.0 | 34.0 | 41.0 | 754 | 697 | 4.39 | 15 | 12 | 45 | 4291 | 1401 | 154 | 661 | 856 | 131 | 178 | 0.979 | Detroit Tigers | Tiger Stadium | 1495785.0 | 101 | 102 | DET | DET | DET |
| 2056 | 1990 | NL | HOU | HOU | W | 4 | 162 | 81.0 | 75 | 87 | N | NaN | N | N | 573 | 5379 | 1301 | 209 | 32 | 94 | 548.0 | 997.0 | 179.0 | 83.0 | 28.0 | 41.0 | 656 | 581 | 3.61 | 12 | 6 | 37 | 4350 | 1396 | 130 | 496 | 854 | 131 | 124 | 0.978 | Houston Astros | Astrodome | 1310927.0 | 97 | 98 | HOU | HOU | HOU |
#reset the index.
df = df.reset_index()
df = df.drop(["index"], axis=1)
pd.set_option('display.max_columns', None)
df.head(5)
df.isnull().sum()
yearID 0 lgID 0 teamID 0 franchID 0 divID 0 Rank 0 G 0 Ghome 0 W 0 L 0 DivWin 0 WCWin 106 LgWin 0 WSWin 0 R 0 AB 0 H 0 2B 0 3B 0 HR 0 BB 0 SO 0 SB 0 CS 0 HBP 0 SF 0 RA 0 ER 0 ERA 0 CG 0 SHO 0 SV 0 IPouts 0 HA 0 HRA 0 BBA 0 SOA 0 E 0 DP 0 FP 0 name 0 park 0 attendance 0 BPF 0 PPF 0 teamIDBR 0 teamIDlahman45 0 teamIDretro 0 dtype: int64
After finding what missing values were in the data set, there are only missing values on the WCWin. At that time, if the team win the DivWin, they make a playoff, so I have changed the missing values on the WCWin columns to N.
# The wildcard system had been entered in since 2012
df["WCWin"].fillna("N", inplace = True)
df["WCWin"].head(120)
0 N 1 N 2 N 3 N 4 N 5 N 6 N 7 N 8 N 9 N 10 N 11 N 12 N 13 N 14 N 15 N 16 N 17 N 18 N 19 N 20 N 21 N 22 N 23 N 24 N 25 N 26 N 27 N 28 N 29 N 30 N 31 N 32 N 33 N 34 N 35 N 36 N 37 N 38 N 39 N 40 N 41 N 42 N 43 N 44 N 45 N 46 N 47 N 48 N 49 N 50 N 51 N 52 N 53 N 54 N 55 N 56 N 57 N 58 N 59 N 60 N 61 N 62 N 63 N 64 N 65 N 66 N 67 N 68 N 69 N 70 N 71 N 72 N 73 N 74 N 75 N 76 N 77 N 78 N 79 N 80 N 81 N 82 N 83 N 84 N 85 N 86 N 87 N 88 N 89 N 90 N 91 N 92 N 93 N 94 N 95 N 96 N 97 N 98 N 99 N 100 N 101 N 102 N 103 N 104 N 105 N 106 N 107 N 108 N 109 N 110 N 111 N 112 N 113 N 114 Y 115 N 116 N 117 N 118 N 119 N Name: WCWin, dtype: object
As teams have changed their franchID, I have checked the teams that were on the same location but different name, so change those teams into LA Angels, Chicago White Sox, Miami Marlins and Tempa Bay Rays
# Check the duplicated values and erase one row, going to use franchID and change the name (refers to MLB Team abbreviations)
print(df['franchID'].unique())
df['franchID'] = df['franchID'].replace({'ANA' : 'LAA', 'CHW' : 'CWS', 'FLA' : 'MIA', 'TBD' : 'TBR'})
# To check that the franchID has been replaced, then franchID only have 30 teams.
print(df['franchID'].unique())
['ATL' 'BAL' 'BOS' 'ANA' 'CHW' 'CHC' 'CIN' 'CLE' 'DET' 'HOU' 'KCR' 'LAD' 'MIN' 'MIL' 'WSN' 'NYY' 'NYM' 'OAK' 'PHI' 'PIT' 'SDP' 'SEA' 'SFG' 'STL' 'TEX' 'TOR' 'COL' 'FLA' 'ARI' 'TBD'] ['ATL' 'BAL' 'BOS' 'LAA' 'CWS' 'CHC' 'CIN' 'CLE' 'DET' 'HOU' 'KCR' 'LAD' 'MIN' 'MIL' 'WSN' 'NYY' 'NYM' 'OAK' 'PHI' 'PIT' 'SDP' 'SEA' 'SFG' 'STL' 'TEX' 'TOR' 'COL' 'MIA' 'ARI' 'TBR']
1B: singles BA: Batting Average (The ratio of hits to at-bats) OBP: On-Base Percentage (The percentage of plate appearances resulting in the batter reaching bases) SLG: Slugging Percentage (A measure of the team's power-hitting ability, calculated as total bases divided by at-bats) TB: Total Bases (The sum of bases earned through singles, doubles, triples, and home runs) OPS: On-Base Plus Slugging (The sum of OBP and SLG) GPA: Gross Production Average (A measure of a player's overall offensive production, combining OBP and SLG) TA: Total Average (A metric considering total bases, walks, hit by pitch, stolen bases, and other factors per plate appearance) PSN: Power-Speed Number (A combined measure of player's home run and stolen base proficiency) ISO: Isolated Power (A measure of a team's raw power, calculated as SLG minus BA) BABIP: Batting Average on Balls in Play (The ratio of walks and hit by pitch to toal at-bats)
WHIP: Walks plus Hits per Inning Pitched (A measure of a pitcher's effectiveness in preventing baserunners) BAA: Batting Average Against (The opposing batters' batting average against the team's pitchers) K/BB: Strikeouts per Walk (The ratio of strikeouts to walks, indicating a pitcher's control and dominance) BB/HBP_ratio: Walks and Hit by Pitch Ratio (The ratio of walks and hit by pitch to total at-bats) IP: Inning Pitched (The total number of innings pitched by the team's pitchers, converted from outs)
FP : Fielding Percentage
P%: Power Percentage (The percentage of runs squared divided by the sum of runs squared and runs allowed squared, providing a measure of a team's power) WP: The ratio of wins to total games played, indicating the team's winning rate.
#Add WP, BA, 1B, OBP, SLG, OPS, IP(IPouts/3), WHIP , TB, GPA, TA, PSN, ISO, BABIP, P%
#Add BAA(Batting Average Against), (BB/HBP_ratio = walk and hit by pitch ratio) , K/BB
df["WP"] = round(df["W"]/df["G"],3)
df["P%"] = round((df["R"]**2)/(df["R"]**2+df["RA"]**2),2)
df["BA"] = round(df["H"]/df["AB"],3)
df["1B"] = df["H"] - df["HR"] - df["3B"] - df["2B"]
df["OBP"] = round((df["H"] + df["BB"] + df["HBP"] + df["SF"])/(df["AB"] + df["BB"] + df["HBP"] + df["SF"]),3)
df["SLG"] = round((df["1B"] + 2*df["2B"] + 3*df["3B"] + 4*df["HR"])/df["AB"],3)
df["TB"] = df["1B"] + 2*df["2B"] +3*df["3B"] + 4*df["HR"]
df["OPS"] = round(df["OBP"] + df["SLG"],3)
df["GPA"] = round((1.8*df["OBP"]+df["SLG"])/4,3)
df["TA"] = round((df["TB"]+df["HBP"]+df["BB"]+df["SB"])/(df["AB"]-df["H"]+df["CS"]+df["DP"]),3)
df["PSN"] = round((df["HR"]*df["SB"]*2)/(df["HR"]+df["SB"]),3)
df["IP"] = round(df["IPouts"]/3,2)
df["WHIP"] = round((df["HA"] + df["BBA"])/(df["IP"]),3)
df["BAA"] = round(df["HA"]/(df["HA"]+df["IP"]),3)
df["K/BB"] = round(df["SO"]/df["BB"],3)
df["ISO"] = df["SLG"]-df["BA"]
df["BB/HBP_ratio"] = round((df["BB"] + df["HBP"])/df["AB"],3)
df["BABIP"] = round((df["H"]-df["HR"])/(df["AB"]-df["SO"]-df["HR"]+df["SF"]),3)
df.head(10)
| yearID | lgID | teamID | franchID | divID | Rank | G | Ghome | W | L | DivWin | WCWin | LgWin | WSWin | R | AB | H | 2B | 3B | HR | BB | SO | SB | CS | HBP | SF | RA | ER | ERA | CG | SHO | SV | IPouts | HA | HRA | BBA | SOA | E | DP | FP | name | park | attendance | BPF | PPF | teamIDBR | teamIDlahman45 | teamIDretro | WP | P% | BA | 1B | OBP | SLG | TB | OPS | GPA | TA | PSN | IP | WHIP | BAA | K/BB | ISO | BB/HBP_ratio | BABIP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1990 | NL | ATL | ATL | W | 6 | 162 | 81.0 | 65 | 97 | N | N | N | N | 682 | 5504 | 1376 | 263 | 26 | 162 | 473.0 | 1010.0 | 92.0 | 55.0 | 27.0 | 31.0 | 821 | 727 | 4.58 | 17 | 8 | 30 | 4289 | 1527 | 128 | 579 | 938 | 158 | 133 | 0.974 | Atlanta Braves | Atlanta-Fulton County Stadium | 980129.0 | 105 | 106 | ATL | ATL | ATL | 0.401 | 0.41 | 0.250 | 925 | 0.316 | 0.396 | 2177 | 0.712 | 0.241 | 0.642 | 117.354 | 1429.67 | 1.473 | 0.516 | 2.135 | 0.146 | 0.091 | 0.278 |
| 1 | 1990 | AL | BAL | BAL | E | 5 | 161 | 80.0 | 76 | 85 | N | N | N | N | 669 | 5410 | 1328 | 234 | 22 | 132 | 660.0 | 962.0 | 94.0 | 52.0 | 40.0 | 41.0 | 698 | 644 | 4.04 | 10 | 5 | 43 | 4306 | 1445 | 161 | 537 | 776 | 93 | 151 | 0.985 | Baltimore Orioles | Memorial Stadium | 2415189.0 | 97 | 98 | BAL | BAL | BAL | 0.472 | 0.48 | 0.245 | 940 | 0.336 | 0.370 | 2002 | 0.706 | 0.244 | 0.653 | 109.805 | 1435.33 | 1.381 | 0.502 | 1.458 | 0.125 | 0.129 | 0.275 |
| 2 | 1990 | AL | BOS | BOS | E | 1 | 162 | 81.0 | 88 | 74 | Y | N | N | N | 699 | 5516 | 1502 | 298 | 31 | 106 | 598.0 | 795.0 | 53.0 | 52.0 | 28.0 | 44.0 | 664 | 596 | 3.72 | 15 | 13 | 44 | 4326 | 1439 | 92 | 519 | 997 | 123 | 154 | 0.980 | Boston Red Sox | Fenway Park II | 2528986.0 | 105 | 105 | BOS | BOS | BOS | 0.543 | 0.53 | 0.272 | 1067 | 0.351 | 0.395 | 2180 | 0.746 | 0.257 | 0.677 | 70.667 | 1442.00 | 1.358 | 0.499 | 1.329 | 0.123 | 0.113 | 0.300 |
| 3 | 1990 | AL | CAL | LAA | W | 4 | 162 | 81.0 | 80 | 82 | N | N | N | N | 690 | 5570 | 1448 | 237 | 27 | 147 | 566.0 | 1000.0 | 69.0 | 43.0 | 28.0 | 45.0 | 706 | 613 | 3.79 | 21 | 13 | 42 | 4362 | 1482 | 106 | 544 | 944 | 142 | 186 | 0.978 | California Angels | Anaheim Stadium | 2555688.0 | 97 | 97 | CAL | CAL | CAL | 0.494 | 0.49 | 0.260 | 1037 | 0.336 | 0.391 | 2180 | 0.727 | 0.249 | 0.653 | 93.917 | 1454.00 | 1.393 | 0.505 | 1.767 | 0.131 | 0.107 | 0.291 |
| 4 | 1990 | AL | CHA | CWS | W | 2 | 162 | 80.0 | 94 | 68 | N | N | N | N | 682 | 5402 | 1393 | 251 | 44 | 106 | 478.0 | 903.0 | 140.0 | 90.0 | 36.0 | 47.0 | 633 | 581 | 3.61 | 17 | 10 | 68 | 4348 | 1313 | 106 | 548 | 914 | 124 | 169 | 0.980 | Chicago White Sox | Comiskey Park | 2002357.0 | 98 | 98 | CHW | CHA | CHA | 0.580 | 0.54 | 0.258 | 992 | 0.328 | 0.379 | 2050 | 0.707 | 0.242 | 0.634 | 120.650 | 1449.33 | 1.284 | 0.475 | 1.889 | 0.121 | 0.095 | 0.290 |
| 5 | 1990 | NL | CHN | CHC | E | 4 | 162 | 81.0 | 77 | 85 | N | N | N | N | 690 | 5600 | 1474 | 240 | 36 | 136 | 406.0 | 869.0 | 151.0 | 50.0 | 30.0 | 51.0 | 774 | 695 | 4.34 | 13 | 7 | 42 | 4328 | 1510 | 121 | 572 | 877 | 124 | 136 | 0.980 | Chicago Cubs | Wrigley Field | 2243791.0 | 108 | 108 | CHC | CHN | CHN | 0.475 | 0.44 | 0.263 | 1062 | 0.322 | 0.392 | 2194 | 0.714 | 0.243 | 0.645 | 143.108 | 1442.67 | 1.443 | 0.511 | 2.140 | 0.129 | 0.078 | 0.288 |
| 6 | 1990 | NL | CIN | CIN | W | 1 | 162 | 81.0 | 91 | 71 | Y | N | Y | Y | 693 | 5525 | 1466 | 284 | 40 | 125 | 466.0 | 913.0 | 166.0 | 66.0 | 42.0 | 42.0 | 597 | 549 | 3.39 | 14 | 12 | 50 | 4369 | 1338 | 124 | 543 | 1029 | 102 | 126 | 0.983 | Cincinnati Reds | Riverfront Stadium | 2400892.0 | 105 | 105 | CIN | CIN | CIN | 0.562 | 0.57 | 0.265 | 1017 | 0.332 | 0.399 | 2205 | 0.731 | 0.249 | 0.677 | 142.612 | 1456.33 | 1.292 | 0.479 | 1.959 | 0.134 | 0.092 | 0.296 |
| 7 | 1990 | AL | CLE | CLE | E | 4 | 162 | 81.0 | 77 | 85 | N | N | N | N | 732 | 5485 | 1465 | 266 | 41 | 110 | 458.0 | 836.0 | 107.0 | 52.0 | 29.0 | 61.0 | 737 | 676 | 4.26 | 12 | 10 | 47 | 4282 | 1491 | 163 | 518 | 860 | 117 | 146 | 0.981 | Cleveland Indians | Cleveland Stadium | 1225240.0 | 100 | 100 | CLE | CLE | CLE | 0.475 | 0.50 | 0.267 | 1048 | 0.334 | 0.391 | 2143 | 0.725 | 0.248 | 0.649 | 108.479 | 1427.33 | 1.408 | 0.511 | 1.825 | 0.124 | 0.089 | 0.295 |
| 8 | 1990 | AL | DET | DET | E | 3 | 162 | 81.0 | 79 | 83 | N | N | N | N | 750 | 5479 | 1418 | 241 | 32 | 172 | 634.0 | 952.0 | 82.0 | 57.0 | 34.0 | 41.0 | 754 | 697 | 4.39 | 15 | 12 | 45 | 4291 | 1401 | 154 | 661 | 856 | 131 | 178 | 0.979 | Detroit Tigers | Tiger Stadium | 1495785.0 | 101 | 102 | DET | DET | DET | 0.488 | 0.50 | 0.259 | 973 | 0.344 | 0.409 | 2239 | 0.753 | 0.257 | 0.696 | 111.055 | 1430.33 | 1.442 | 0.495 | 1.502 | 0.150 | 0.122 | 0.283 |
| 9 | 1990 | NL | HOU | HOU | W | 4 | 162 | 81.0 | 75 | 87 | N | N | N | N | 573 | 5379 | 1301 | 209 | 32 | 94 | 548.0 | 997.0 | 179.0 | 83.0 | 28.0 | 41.0 | 656 | 581 | 3.61 | 12 | 6 | 37 | 4350 | 1396 | 130 | 496 | 854 | 131 | 124 | 0.978 | Houston Astros | Astrodome | 1310927.0 | 97 | 98 | HOU | HOU | HOU | 0.463 | 0.43 | 0.242 | 966 | 0.320 | 0.345 | 1856 | 0.665 | 0.230 | 0.609 | 123.267 | 1450.00 | 1.305 | 0.491 | 1.819 | 0.103 | 0.107 | 0.279 |
#Check which teams are qualifed for playoff
df["make_playoffs_rank_1"] = df["Rank"]==1
df["make_playoffs_wild_card"] = df["WCWin"]=="Y"
df["make_playoffs_win_division"] = df["DivWin"]=="Y"
df["make_playoffs"] = df["make_playoffs_rank_1"] | df["make_playoffs_wild_card"] | df["make_playoffs_win_division"]
df = pd.get_dummies(df, columns = ["make_playoffs"], drop_first = True)
df.head(10)
| yearID | lgID | teamID | franchID | divID | Rank | G | Ghome | W | L | DivWin | WCWin | LgWin | WSWin | R | AB | H | 2B | 3B | HR | BB | SO | SB | CS | HBP | SF | RA | ER | ERA | CG | SHO | SV | IPouts | HA | HRA | BBA | SOA | E | DP | FP | name | park | attendance | BPF | PPF | teamIDBR | teamIDlahman45 | teamIDretro | WP | P% | BA | 1B | OBP | SLG | TB | OPS | GPA | TA | PSN | IP | WHIP | BAA | K/BB | ISO | BB/HBP_ratio | BABIP | make_playoffs_rank_1 | make_playoffs_wild_card | make_playoffs_win_division | make_playoffs_True | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1990 | NL | ATL | ATL | W | 6 | 162 | 81.0 | 65 | 97 | N | N | N | N | 682 | 5504 | 1376 | 263 | 26 | 162 | 473.0 | 1010.0 | 92.0 | 55.0 | 27.0 | 31.0 | 821 | 727 | 4.58 | 17 | 8 | 30 | 4289 | 1527 | 128 | 579 | 938 | 158 | 133 | 0.974 | Atlanta Braves | Atlanta-Fulton County Stadium | 980129.0 | 105 | 106 | ATL | ATL | ATL | 0.401 | 0.41 | 0.250 | 925 | 0.316 | 0.396 | 2177 | 0.712 | 0.241 | 0.642 | 117.354 | 1429.67 | 1.473 | 0.516 | 2.135 | 0.146 | 0.091 | 0.278 | False | False | False | 0 |
| 1 | 1990 | AL | BAL | BAL | E | 5 | 161 | 80.0 | 76 | 85 | N | N | N | N | 669 | 5410 | 1328 | 234 | 22 | 132 | 660.0 | 962.0 | 94.0 | 52.0 | 40.0 | 41.0 | 698 | 644 | 4.04 | 10 | 5 | 43 | 4306 | 1445 | 161 | 537 | 776 | 93 | 151 | 0.985 | Baltimore Orioles | Memorial Stadium | 2415189.0 | 97 | 98 | BAL | BAL | BAL | 0.472 | 0.48 | 0.245 | 940 | 0.336 | 0.370 | 2002 | 0.706 | 0.244 | 0.653 | 109.805 | 1435.33 | 1.381 | 0.502 | 1.458 | 0.125 | 0.129 | 0.275 | False | False | False | 0 |
| 2 | 1990 | AL | BOS | BOS | E | 1 | 162 | 81.0 | 88 | 74 | Y | N | N | N | 699 | 5516 | 1502 | 298 | 31 | 106 | 598.0 | 795.0 | 53.0 | 52.0 | 28.0 | 44.0 | 664 | 596 | 3.72 | 15 | 13 | 44 | 4326 | 1439 | 92 | 519 | 997 | 123 | 154 | 0.980 | Boston Red Sox | Fenway Park II | 2528986.0 | 105 | 105 | BOS | BOS | BOS | 0.543 | 0.53 | 0.272 | 1067 | 0.351 | 0.395 | 2180 | 0.746 | 0.257 | 0.677 | 70.667 | 1442.00 | 1.358 | 0.499 | 1.329 | 0.123 | 0.113 | 0.300 | True | False | True | 1 |
| 3 | 1990 | AL | CAL | LAA | W | 4 | 162 | 81.0 | 80 | 82 | N | N | N | N | 690 | 5570 | 1448 | 237 | 27 | 147 | 566.0 | 1000.0 | 69.0 | 43.0 | 28.0 | 45.0 | 706 | 613 | 3.79 | 21 | 13 | 42 | 4362 | 1482 | 106 | 544 | 944 | 142 | 186 | 0.978 | California Angels | Anaheim Stadium | 2555688.0 | 97 | 97 | CAL | CAL | CAL | 0.494 | 0.49 | 0.260 | 1037 | 0.336 | 0.391 | 2180 | 0.727 | 0.249 | 0.653 | 93.917 | 1454.00 | 1.393 | 0.505 | 1.767 | 0.131 | 0.107 | 0.291 | False | False | False | 0 |
| 4 | 1990 | AL | CHA | CWS | W | 2 | 162 | 80.0 | 94 | 68 | N | N | N | N | 682 | 5402 | 1393 | 251 | 44 | 106 | 478.0 | 903.0 | 140.0 | 90.0 | 36.0 | 47.0 | 633 | 581 | 3.61 | 17 | 10 | 68 | 4348 | 1313 | 106 | 548 | 914 | 124 | 169 | 0.980 | Chicago White Sox | Comiskey Park | 2002357.0 | 98 | 98 | CHW | CHA | CHA | 0.580 | 0.54 | 0.258 | 992 | 0.328 | 0.379 | 2050 | 0.707 | 0.242 | 0.634 | 120.650 | 1449.33 | 1.284 | 0.475 | 1.889 | 0.121 | 0.095 | 0.290 | False | False | False | 0 |
| 5 | 1990 | NL | CHN | CHC | E | 4 | 162 | 81.0 | 77 | 85 | N | N | N | N | 690 | 5600 | 1474 | 240 | 36 | 136 | 406.0 | 869.0 | 151.0 | 50.0 | 30.0 | 51.0 | 774 | 695 | 4.34 | 13 | 7 | 42 | 4328 | 1510 | 121 | 572 | 877 | 124 | 136 | 0.980 | Chicago Cubs | Wrigley Field | 2243791.0 | 108 | 108 | CHC | CHN | CHN | 0.475 | 0.44 | 0.263 | 1062 | 0.322 | 0.392 | 2194 | 0.714 | 0.243 | 0.645 | 143.108 | 1442.67 | 1.443 | 0.511 | 2.140 | 0.129 | 0.078 | 0.288 | False | False | False | 0 |
| 6 | 1990 | NL | CIN | CIN | W | 1 | 162 | 81.0 | 91 | 71 | Y | N | Y | Y | 693 | 5525 | 1466 | 284 | 40 | 125 | 466.0 | 913.0 | 166.0 | 66.0 | 42.0 | 42.0 | 597 | 549 | 3.39 | 14 | 12 | 50 | 4369 | 1338 | 124 | 543 | 1029 | 102 | 126 | 0.983 | Cincinnati Reds | Riverfront Stadium | 2400892.0 | 105 | 105 | CIN | CIN | CIN | 0.562 | 0.57 | 0.265 | 1017 | 0.332 | 0.399 | 2205 | 0.731 | 0.249 | 0.677 | 142.612 | 1456.33 | 1.292 | 0.479 | 1.959 | 0.134 | 0.092 | 0.296 | True | False | True | 1 |
| 7 | 1990 | AL | CLE | CLE | E | 4 | 162 | 81.0 | 77 | 85 | N | N | N | N | 732 | 5485 | 1465 | 266 | 41 | 110 | 458.0 | 836.0 | 107.0 | 52.0 | 29.0 | 61.0 | 737 | 676 | 4.26 | 12 | 10 | 47 | 4282 | 1491 | 163 | 518 | 860 | 117 | 146 | 0.981 | Cleveland Indians | Cleveland Stadium | 1225240.0 | 100 | 100 | CLE | CLE | CLE | 0.475 | 0.50 | 0.267 | 1048 | 0.334 | 0.391 | 2143 | 0.725 | 0.248 | 0.649 | 108.479 | 1427.33 | 1.408 | 0.511 | 1.825 | 0.124 | 0.089 | 0.295 | False | False | False | 0 |
| 8 | 1990 | AL | DET | DET | E | 3 | 162 | 81.0 | 79 | 83 | N | N | N | N | 750 | 5479 | 1418 | 241 | 32 | 172 | 634.0 | 952.0 | 82.0 | 57.0 | 34.0 | 41.0 | 754 | 697 | 4.39 | 15 | 12 | 45 | 4291 | 1401 | 154 | 661 | 856 | 131 | 178 | 0.979 | Detroit Tigers | Tiger Stadium | 1495785.0 | 101 | 102 | DET | DET | DET | 0.488 | 0.50 | 0.259 | 973 | 0.344 | 0.409 | 2239 | 0.753 | 0.257 | 0.696 | 111.055 | 1430.33 | 1.442 | 0.495 | 1.502 | 0.150 | 0.122 | 0.283 | False | False | False | 0 |
| 9 | 1990 | NL | HOU | HOU | W | 4 | 162 | 81.0 | 75 | 87 | N | N | N | N | 573 | 5379 | 1301 | 209 | 32 | 94 | 548.0 | 997.0 | 179.0 | 83.0 | 28.0 | 41.0 | 656 | 581 | 3.61 | 12 | 6 | 37 | 4350 | 1396 | 130 | 496 | 854 | 131 | 124 | 0.978 | Houston Astros | Astrodome | 1310927.0 | 97 | 98 | HOU | HOU | HOU | 0.463 | 0.43 | 0.242 | 966 | 0.320 | 0.345 | 1856 | 0.665 | 0.230 | 0.609 | 123.267 | 1450.00 | 1.305 | 0.491 | 1.819 | 0.103 | 0.107 | 0.279 | False | False | False | 0 |
df.groupby("yearID")["make_playoffs_True"].value_counts()
yearID make_playoffs_True
1990 0 22
1 4
1991 0 22
1 4
1992 0 22
1 4
1993 0 24
1 4
1995 0 20
1 8
1996 0 20
1 8
1997 0 20
1 8
1998 0 22
1 8
1999 0 22
1 8
2000 0 22
1 8
2001 0 22
1 8
2002 0 22
1 8
2003 0 22
1 8
2004 0 22
1 8
2005 0 22
1 8
2006 0 22
1 8
2007 0 22
1 8
2008 0 22
1 8
2009 0 22
1 8
2010 0 22
1 8
2011 0 22
1 8
2012 0 20
1 10
2013 0 20
1 10
2014 0 20
1 10
2015 0 20
1 10
2016 0 20
1 10
2017 0 20
1 10
2018 0 20
1 10
2019 0 20
1 10
2021 0 20
1 10
Name: make_playoffs_True, dtype: int64
In the visualization part, first I created a correlation matrix/heatmap of all the numeric variables in the data frame. It easily shows the correlation between the variables. If the correlation is high between each others, one may need to be removed during the building the model to avoid multicollinearity problem. As new variables are built using the exist columns, such as OPS, OBP, WHIP are made by the variables that already exist such as how many hits(homeruns) they made or ERA. There should be some high correlations then I should figure out which variables that I would like to use mostly in this model building step.
#make a Correlation map of all the columns in the data set
plt.subplots(figsize=(20,20))
plt.title("Correlation Matrix of Baseball data")
sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True, annot_kws = {'size':5}, cmap = "YlGnBu")
<Axes: title={'center': 'Correlation Matrix of Baseball data'}>
As it has a lot of columns, I would like to divide into 2 seperate correlations to make heatmap readability.
df.shape[1]
70
df1 = df.iloc[:, :37]
df2 = df.iloc[:, 37:]
plt.figure(figsize=(10, 10))
sns.heatmap(df1.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap - Part 1')
plt.show()
plt.figure(figsize=(10, 10))
sns.heatmap(df2.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap - Part 2')
plt.show()
Figure below shows from 1990 to 2021, I would like to figure out which teams qualified playoff the most and the least. New York Yankees appeared the most with more than 20 times and Atlanta Braves appeard the second most time with about 20 times. Miami Marlins and Kansas City Royals appeared the least with only 2 times.
#Make a barplot to check who qualifed the playoff in 1990-2021
playoff_qual = df[df["make_playoffs_True"]==1]["franchID"]
plt.subplots(figsize=(15,5))
plt.hist(playoff_qual, bins=100)
plt.title("Playoff Qualifiers franchises (1990-2021)")
Text(0.5, 1.0, 'Playoff Qualifiers franchises (1990-2021)')
The figures below show comparisons of most of the variables that I made during the data preprocessing step between teams that qualified for the playoffs and teams that could not qualify for the playoffs.
Winning Percentage which shows the total team's result for season 1990-2021, as we can expect easily that teams that qualified playoff have higher winning percentage, even the team has the lowest winning percentage still have better percentage than the average percentage of the teams that could not qualify the playoff. Power Percentage is the stats that from runs and runs allowed. It can show us the approximate relation between runs and runs allowed. Additionally, it demonstrates the significant difference between the teams that made the playoffs and those that couldn't make the playoffs. The better teams have a higher chance to score more and allow fewer runs, and this connects to winning the game.
For Batting Average, there are not much different between the teams that make playoff and not making playoff. The difference is approximately 0.007, and the highest batting average team that could not make playoff is almost same or higher than the highest batting average team who make playoff. Then compare the Hitters Statistics which are OBP, SLG, OPS, GPA, TA, BABIP, PSN, ISO, and BB/HBP_ratio. Most of the statistics that the team who made a playoff are slightly higher than the team who couldn't make a playoff, but it doesn't show me the bigger gap or dramatic difference between those teams than I expected. But when we look at TA and GPA, which are Total Average and Gross Production Average, they cover most of the hitting statistics and show the bigger difference between the teams that make the playoffs and the teams that couldn't make the playoffs. I could possibly say that each individual stat doesn't give me the dramatic gap between the two categories, but when we use advanced statistics, even small differences can make a bigger gap.
There are also Pittching Statistics which are K/BB, BAA, ERA, and WHIP. An interesting result was shown in K/BB. Usually, a better team has a higher K/BB ratio, but in our plot, it shows that the teams that couldn't make the playoffs have a higher K/BB ratio. Furthermore, the range of K/BB for the non-qualifying teams is higher than that of the qualifying teams. Except for K/BB, the other statistics—BAA, ERA, and WHIP—show typical plots. Less ERA, BAA and WHIP show the better pitcher teams in Baseball.
Fielding Percentage shows how well the player performs when they are on the pitch inning. It measures the number of successfully fielded balls (putouts and assists) divided by the number of opportunities (putouts, assists, and errors). It doesn't show a significant difference between the teams that make the playoffs and the teams that couldn't make the playoffs. However, the lower quartile of the teams that couldn't make the playoffs is much lower than that of the teams that make the playoffs.
To summarize my findings from my visualization, most of the teams' batting statistics are not much different between the teams that make the playoffs and the teams that couldn't make the playoffs. However, there are differences in the pitching statistics. Therefore, I could say that to be a better team, the pitchers are more important than the hitters to make a playoff.
import matplotlib.pyplot as plt
import seaborn as sns
# List of statistics to analyze
statistics = ['WP', 'P%', 'BA', 'OBP', 'SLG', 'OPS', 'GPA', 'TA', 'BABIP', 'PSN', 'ISO', 'BB/HBP_ratio',
'K/BB', 'BAA', 'ERA', 'WHIP',
'FP']
# Iterate through each statistic
for stat in statistics:
# Calculate the average for playoff teams
average_playoff_teams = df[df["make_playoffs_True"] == 1][stat].mean()
# Calculate the average for non-playoff teams
average_non_playoff_teams = df[df["make_playoffs_True"] == 0][stat].mean()
# Print the results
print(f"Average {stat} for playoff teams: {average_playoff_teams}")
print(f"Average {stat} for non-playoff teams: {average_non_playoff_teams}")
# Create a boxplot for the current statistic
plt.figure(figsize=(8, 6))
sns.boxplot(x="make_playoffs_True", y=stat, data=df)
plt.xticks([0, 1], ["No Playoff", "Playoff"])
plt.xlabel("Playoff Qualification")
plt.ylabel(stat)
plt.title(f"Comparison of {stat} between Playoff and Non-Playoff Teams")
plt.show()
Average WP for playoff teams: 0.583107438016529 Average WP for non-playoff teams: 0.4683495297805642
Average P% for playoff teams: 0.5793388429752065 Average P% for non-playoff teams: 0.4706112852664577
Average BA for playoff teams: 0.26522727272727276 Average BA for non-playoff teams: 0.25834012539184953
Average OBP for playoff teams: 0.3448925619834711 Average OBP for non-playoff teams: 0.3322836990595612
Average SLG for playoff teams: 0.4299504132231405 Average SLG for non-playoff teams: 0.40727115987460816
Average OPS for playoff teams: 0.7748429752066116 Average OPS for non-playoff teams: 0.7395548589341693
Average GPA for playoff teams: 0.2626776859504132 Average GPA for non-playoff teams: 0.25135736677115983
Average TA for playoff teams: 0.731396694214876 Average TA for non-playoff teams: 0.6808275862068967
Average BABIP for playoff teams: 0.3000413223140495 Average BABIP for non-playoff teams: 0.29550313479623824
Average PSN for playoff teams: 122.8812479338843 Average PSN for non-playoff teams: 116.20618181818183
Average ISO for playoff teams: 0.16472314049586775 Average ISO for non-playoff teams: 0.14893103448275863
Average BB/HBP_ratio for playoff teams: 0.11323553719008266 Average BB/HBP_ratio for non-playoff teams: 0.10300156739811912
Average K/BB for playoff teams: 2.011099173553719 Average K/BB for non-playoff teams: 2.215307210031348
Average BAA for playoff teams: 0.4877148760330579 Average BAA for non-playoff teams: 0.5029639498432601
Average ERA for playoff teams: 3.8762396694214876 Average ERA for non-playoff teams: 4.39628526645768
Average WHIP for playoff teams: 1.2971983471074382 Average WHIP for non-playoff teams: 1.3913150470219435
Average FP for playoff teams: 0.9838595041322314 Average FP for non-playoff teams: 0.9822727272727274
#Make the plot to see OPS, GPA, TA, ISO, BABIP, and P% per Each Team
plt.figure(figsize=(12,5))
# Calculate the average of each metric for each team and year
team_batting_statistics = df.groupby(['franchID'])[['OPS','GPA','TA','ISO','BABIP','P%']].mean()
# Separate playoff qualifiers and non-qualifiers for each year
playoff_qualifiers = df[df['make_playoffs_True'] == 1].groupby(['franchID']).mean()
non_qualifiers = df[df['make_playoffs_True'] == 0].groupby(['franchID']).mean()
# Make the chart look pretty
plt.figure(figsize=(20, 10))
# Melt the DataFrame to reformat it for barplot
team_statistic_melted1 = team_batting_statistics.reset_index().melt(id_vars='franchID', var_name='Statistic', value_name='Value')
# Draw the barplot
ax = sns.barplot(data=team_statistic_melted1, x='franchID', y='Value', hue='Statistic', palette="pastel")
plt.xticks(rotation=50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Average OPS, GPA, TA, ISO, BABIP, and P% per Team")
plt.xlabel("Team")
plt.ylabel("Statistics")
plt.show()
<Figure size 1200x500 with 0 Axes>
#Make the plot to see WHIP, WP, BAA, K/BB, BB/HBP_ratio per Each Team
plt.figure(figsize=(12,5))
# Calculate the average of each metric for each team and year
team_pitching_statistics = df.groupby(['franchID'])[['WHIP','BAA','K/BB','BB/HBP_ratio']].mean()
# Separate playoff qualifiers and non-qualifiers for each year
playoff_qualifiers = df[df['make_playoffs_True'] == 1].groupby(['franchID']).mean()
non_qualifiers = df[df['make_playoffs_True'] == 0].groupby(['franchID']).mean()
# Make the chart look pretty
plt.figure(figsize=(20, 10))
# Melt the DataFrame to reformat it for barplot
team_statistic_melted2 = team_pitching_statistics.reset_index().melt(id_vars='franchID', var_name='Statistic', value_name='Value')
# Draw the barplot
ax = sns.barplot(data=team_statistic_melted2, x='franchID', y='Value', hue='Statistic', palette="pastel")
plt.xticks(rotation=50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Average WHIP, BAA, K/BB, BB/HBP_ratio per Team")
plt.xlabel("Team")
plt.ylabel("Statistics")
plt.show()
<Figure size 1200x500 with 0 Axes>
Before I build a model, from the heatmap that I plotted above, I would like to remove the columns that are highly correlated or the columns that are categorized or the columns that seem not meaningful for the model. After remove the columns I found that my new data frame have 880 observations with 13 columns that total data is 11440. It still can be use train-test split without resampling method(K-fold cross validation), but as the observation is not as much as I expected, therefore, I would like to use 10 folds cross validation to make data frame more meaningful. The variables that I created from the preprocessing steps, which are OBP, SLG, and OPS, are all related to GPA; therefore, I removed those variables. After that, I seperated out my predictor variable which is make_playoffs_True.
#Removing unnecessary columns to do modeling.
columns_to_drop = ["yearID","G","lgID","franchID","divID","Ghome","W","L","AB","H","1B","2B","3B","BB",
"SB","CG","SHO","SV","IPouts","R","RA","ER","teamID","name","park",
"attendance","DivWin","WCWin","LgWin","WSWin","teamIDBR","teamIDlahman45","teamIDretro",
"make_playoffs_rank_1","make_playoffs_wild_card","make_playoffs_win_division","PPF","BPF",
"HA","HRA","BBA","SO","SOA","E","CS","SF","DP","BA","Rank","HR","HBP","TB","IP", "WP", "OBP", "SLG", "OPS"]
ndf = df.drop(columns =columns_to_drop, inplace = False)
ndf.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 880 entries, 0 to 879 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ERA 880 non-null float64 1 FP 880 non-null float64 2 P% 880 non-null float64 3 GPA 880 non-null float64 4 TA 880 non-null float64 5 PSN 880 non-null float64 6 WHIP 880 non-null float64 7 BAA 880 non-null float64 8 K/BB 880 non-null float64 9 ISO 880 non-null float64 10 BB/HBP_ratio 880 non-null float64 11 BABIP 880 non-null float64 12 make_playoffs_True 880 non-null uint8 dtypes: float64(12), uint8(1) memory usage: 83.5 KB
#make a heatmap of the important columns to predict the playoff in the data set // 21 columns - 7
plt.subplots(figsize=(10,10))
plt.title("Correlation Matrix of Baseball data")
sns.heatmap(ndf.corr(), vmin=-1, vmax=1, annot=True, annot_kws = {'size':15}, cmap = "YlGnBu")
<Axes: title={'center': 'Correlation Matrix of Baseball data'}>
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Separate features (X) and target variable (y)
features = ['BB/HBP_ratio', 'K/BB', 'BAA', 'BABIP', 'PSN', 'TA', 'GPA', 'ERA', 'WHIP', 'FP', "ISO", "P%"]
X = ndf[features]
y = ndf["make_playoffs_True"]
# Initialize k-fold cross-validator
n_splits = 10
kf = KFold(n_splits=n_splits, shuffle = True, random_state=42)
# Split data into training and testing sets with a test size of 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.linear_model import LogisticRegression
# Initialize lists to store evaluation metrics
accuracy_scores = []
classification_reports = []
confusion_matrices = []
# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Create and train the logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
# Make predictions on the test set
y_pred_lr = lr.predict(X_test)
# Evaluate the model's performance
accuracy_lr = accuracy_score(y_test, y_pred_lr)
classification_rep_lr = classification_report(y_test, y_pred_lr)
conf_matrix = confusion_matrix(y_test, y_pred_lr)
# Append evaluation metrics to respective lists
accuracy_scores.append(accuracy_lr)
classification_reports.append(classification_rep_lr)
confusion_matrices.append(conf_matrix)
# Calculate mean accuracy score
mean_accuracy = np.mean(accuracy_scores)
# Print mean accuracy score
print("Mean Accuracy:", mean_accuracy)
print("Accuracy:", accuracy_lr)
print("Classification Report:")
print(classification_rep_lr)
print("Confusion Matrix:")
print(conf_matrix)
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Mean Accuracy: 0.8139034205231388
Accuracy: 0.8571428571428571
Classification Report:
precision recall f1-score support
0 0.89 0.92 0.91 53
1 0.73 0.65 0.69 17
accuracy 0.86 70
macro avg 0.81 0.79 0.80 70
weighted avg 0.85 0.86 0.85 70
Confusion Matrix:
[[49 4]
[ 6 11]]
# Figure out meaningful features.
from sklearn.feature_selection import RFE
lrrfe = LogisticRegression()
rfe = RFE(lrrfe, n_features_to_select=10)
fit = rfe.fit(X_train, y_train)
selected_features = fit.support_
print("Selected Features:", selected_features)
Selected Features: [ True True True True False True True True True False True True]
from sklearn.ensemble import RandomForestClassifier
# Initialize lists to store evaluation metrics
accuracy_scores = []
classification_reports = []
confusion_matrices = []
# Split data into training and testing sets with a test size of 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Create and train the Random Forest classifier model
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
# Make predictions on the test set
y_pred_rfc = rfc.predict(X_test)
# Evaluate the model's performance
accuracy_rfc = accuracy_score(y_test, y_pred_rfc)
classification_rep_rfc = classification_report(y_test, y_pred_rfc)
conf_matrix = confusion_matrix(y_test, y_pred_rfc)
# Append evaluation metrics to respective lists
accuracy_scores.append(accuracy_rfc)
classification_reports.append(classification_rep_rfc)
confusion_matrices.append(conf_matrix)
# Calculate mean accuracy score
mean_accuracy = np.mean(accuracy_scores)
# Print mean accuracy score
print("Mean Accuracy:", mean_accuracy)
print("Importance:", rfc.feature_importances_)
print("Accuracy:", accuracy_rfc)
print("Classification Report:")
print(classification_rep_rfc)
print("Confusion Matrix:")
print(conf_matrix)
Mean Accuracy: 0.8749698189134809
Importance: [0.04177033 0.04881799 0.0558487 0.04426561 0.04872195 0.10261489
0.07658676 0.07670012 0.10828458 0.02473012 0.06631842 0.30534053]
Accuracy: 0.9
Classification Report:
precision recall f1-score support
0 0.94 0.92 0.93 53
1 0.78 0.82 0.80 17
accuracy 0.90 70
macro avg 0.86 0.87 0.87 70
weighted avg 0.90 0.90 0.90 70
Confusion Matrix:
[[49 4]
[ 3 14]]
from xgboost import XGBClassifier
# Initialize lists to store evaluation metrics
accuracy_scores = []
classification_reports = []
confusion_matrices = []
# Split data into training and testing sets with a test size of 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Create and train the XGBoost classifier model
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
# Make predictions on the test set
y_pred_xgb = xgb.predict(X_test)
# Evaluate the model's performance
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
classification_rep = classification_report(y_test, y_pred_xgb)
classification_rep_xgb = confusion_matrix(y_test, y_pred_xgb)
# Append evaluation metrics to respective lists
accuracy_scores.append(accuracy_xgb)
classification_reports.append(classification_rep)
confusion_matrices.append(classification_rep_xgb)
# Calculate mean accuracy score
mean_accuracy = np.mean(accuracy_scores)
# Print mean accuracy score
print("Mean Accuracy:", mean_accuracy)
print("Accuracy:", accuracy_xgb)
print("Classification Report:")
print(classification_rep_xgb)
print("Confusion Matrix:")
print(conf_matrix)
Mean Accuracy: 0.8636016096579476 Accuracy: 0.8714285714285714 Classification Report: [[49 4] [ 5 12]] Confusion Matrix: [[49 4] [ 3 14]]
from sklearn.svm import SVC
# Initialize lists to store evaluation metrics
accuracy_scores = []
classification_reports = []
confusion_matrices = []
# Split data into training and testing sets with a test size of 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Create the SVM classifier model
svc = SVC()
# Fit the model on the training data
svc.fit(X_train, y_train)
# Make predictions on the test set
y_pred_svc = svc.predict(X_test)
# Evaluate the model's performance
accuracy_svc = accuracy_score(y_test, y_pred_svc)
classification_rep_svc = classification_report(y_test, y_pred_svc)
conf_matrix = confusion_matrix(y_test, y_pred_svc)
# Append evaluation metrics to respective lists
accuracy_scores.append(accuracy_svc)
classification_reports.append(classification_rep_svc)
confusion_matrices.append(conf_matrix)
# Calculate mean accuracy score
mean_accuracy = np.mean(accuracy_scores)
# Print mean accuracy score
print("Mean Accuracy:", mean_accuracy)
print("Accuracy:", accuracy_svc)
print("Classification Report:")
print(classification_rep_svc)
print("Confusion Matrix:")
print(conf_matrix)
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
Mean Accuracy: 0.7414285714285713
Accuracy: 0.7571428571428571
Classification Report:
precision recall f1-score support
0 0.76 1.00 0.86 53
1 0.00 0.00 0.00 17
accuracy 0.76 70
macro avg 0.38 0.50 0.43 70
weighted avg 0.57 0.76 0.65 70
Confusion Matrix:
[[53 0]
[17 0]]
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
# Initialize lists to store evaluation metrics
accuracy_scores = []
classification_reports = []
confusion_matrices = []
# Perform train-test split with a test size of 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train_scaled):
X_train_fold, X_test_fold = X_train_scaled[train_index], X_train_scaled[test_index]
y_train_fold, y_test_fold = y_train.iloc[train_index], y_train.iloc[test_index]
# Create and train the k-nearest neighbors model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_fold, y_train_fold)
# Make predictions on the test set
y_pred_knn = knn.predict(X_test_scaled)
# Evaluate the model's performance
accuracy_knn = accuracy_score(y_test, y_pred_knn)
classification_rep_knn = classification_report(y_test, y_pred_knn)
conf_matrix = confusion_matrix(y_test, y_pred_knn)
# Append evaluation metrics to respective lists
accuracy_scores.append(accuracy_knn)
classification_reports.append(classification_rep_knn)
confusion_matrices.append(conf_matrix)
# Calculate mean accuracy score
mean_accuracy = np.mean(accuracy_scores)
# Print mean accuracy score
print("Mean Accuracy:", mean_accuracy)
print("Accuracy:", accuracy_knn)
print("Classification Report:")
print(classification_rep_knn)
print("Confusion Matrix:")
print(conf_matrix)
Mean Accuracy: 0.8835227272727273
Accuracy: 0.8806818181818182
Classification Report:
precision recall f1-score support
0 0.89 0.95 0.92 125
1 0.86 0.71 0.77 51
accuracy 0.88 176
macro avg 0.87 0.83 0.85 176
weighted avg 0.88 0.88 0.88 176
Confusion Matrix:
[[119 6]
[ 15 36]]
# Grid Search for best hyperparameter k
param_grid = {'n_neighbors': range(1, 10)}
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)
print("Best k:", grid_search.best_params_['n_neighbors'])
print("Best Accuracy:", grid_search.best_score_)
Best k: 5 Best Accuracy: 0.8522289766970618
models = {"LR": [accuracy_lr],
"RFC":[accuracy_rfc],
"XGB":[accuracy_xgb],
"SVC":[accuracy_svc],
"KNN": [accuracy_knn]}
# Create a DataFrame from the models dictionary
results_df = pd.DataFrame.from_dict(models, orient='index', columns=['Accuracy'])
results_df
| Accuracy | |
|---|---|
| LR | 0.857143 |
| RFC | 0.900000 |
| XGB | 0.871429 |
| SVC | 0.757143 |
| KNN | 0.880682 |
import warnings
warnings.filterwarnings("ignore")
# predict playoff qualifiers for a given year (1990-2021), except for 1994, 2020 season.
for year in range(1990, 2022):
if year != 1994 and year != 2020:
_df = df[df["yearID"] == year].copy()
X_year = _df[features]
# Predict using the trained Logistic Regression model
predicted_playoff_qualifiers_lr = lr.predict(X_year)
# Add the predicted_playoff_qualifier column to the copied DataFrame
_df["predicted_playoff_qualifier_lr"] = predicted_playoff_qualifiers_lr
print("--------")
print(year)
print("--------")
print("Actual playoff qualifiers in " + str(year) + ":")
actual = set(_df[_df["make_playoffs_True"] == 1]["franchID"])
print(actual)
print("Predicted playoff qualifiers using Logistic Regression in " + str(year) + ":")
predicted_lr = set(_df[_df["predicted_playoff_qualifier_lr"] == 1]["franchID"])
print(predicted_lr)
print()
incorrect_lr = predicted_lr.difference(actual)
print("Incorrect predictions using Logistic Regression (false positives) " + str(len(incorrect_lr)) + ":")
print(incorrect_lr)
exclusions_lr = actual.difference(predicted_lr)
print("Incorrect exclusions from prediction using Logistic Regression (false negatives) " + str(len(exclusions_lr)) + ":")
print(exclusions_lr)
print()
print()
--------
1990
--------
Actual playoff qualifiers in 1990:
{'PIT', 'BOS', 'OAK', 'CIN'}
Predicted playoff qualifiers using Logistic Regression in 1990:
{'PIT', 'WSN', 'SEA', 'TOR', 'OAK', 'NYM', 'CIN', 'LAD'}
Incorrect predictions using Logistic Regression (false positives) 5:
{'WSN', 'SEA', 'TOR', 'NYM', 'LAD'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 1:
{'BOS'}
--------
1991
--------
Actual playoff qualifiers in 1991:
{'PIT', 'TOR', 'MIN', 'ATL'}
Predicted playoff qualifiers using Logistic Regression in 1991:
{'PIT', 'MIN', 'ATL', 'TOR', 'CWS', 'NYM', 'LAD'}
Incorrect predictions using Logistic Regression (false positives) 3:
{'CWS', 'NYM', 'LAD'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 0:
set()
--------
1992
--------
Actual playoff qualifiers in 1992:
{'PIT', 'TOR', 'OAK', 'ATL'}
Predicted playoff qualifiers using Logistic Regression in 1992:
{'PIT', 'STL', 'WSN', 'MIN', 'ATL', 'TOR', 'OAK', 'BAL', 'CWS', 'MIL', 'CIN'}
Incorrect predictions using Logistic Regression (false positives) 7:
{'STL', 'WSN', 'MIN', 'BAL', 'CWS', 'MIL', 'CIN'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 0:
set()
--------
1993
--------
Actual playoff qualifiers in 1993:
{'TOR', 'CWS', 'PHI', 'ATL'}
Predicted playoff qualifiers using Logistic Regression in 1993:
{'WSN', 'ATL', 'TOR', 'HOU', 'PHI', 'SFG', 'CWS'}
Incorrect predictions using Logistic Regression (false positives) 3:
{'SFG', 'HOU', 'WSN'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 0:
set()
--------
1995
--------
Actual playoff qualifiers in 1995:
{'CLE', 'BOS', 'ATL', 'SEA', 'COL', 'CIN', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 1995:
{'CLE', 'CIN', 'ATL'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 5:
{'BOS', 'SEA', 'COL', 'NYY', 'LAD'}
--------
1996
--------
Actual playoff qualifiers in 1996:
{'STL', 'CLE', 'ATL', 'BAL', 'SDP', 'NYY', 'LAD', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 1996:
{'CLE', 'SDP', 'ATL'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 5:
{'STL', 'BAL', 'NYY', 'LAD', 'TEX'}
--------
1997
--------
Actual playoff qualifiers in 1997:
{'CLE', 'ATL', 'SEA', 'HOU', 'BAL', 'SFG', 'MIA', 'NYY'}
Predicted playoff qualifiers using Logistic Regression in 1997:
{'ATL', 'HOU', 'MIA', 'NYY', 'LAD'}
Incorrect predictions using Logistic Regression (false positives) 1:
{'LAD'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'CLE', 'BAL', 'SFG', 'SEA'}
--------
1998
--------
Actual playoff qualifiers in 1998:
{'CLE', 'BOS', 'ATL', 'HOU', 'CHC', 'SDP', 'NYY', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 1998:
{'SDP', 'HOU', 'NYY', 'ATL'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'CLE', 'BOS', 'CHC', 'TEX'}
--------
1999
--------
Actual playoff qualifiers in 1999:
{'CLE', 'BOS', 'ATL', 'HOU', 'NYY', 'NYM', 'ARI', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 1999:
{'BOS', 'ATL', 'HOU', 'NYY', 'NYM', 'CIN', 'ARI'}
Incorrect predictions using Logistic Regression (false positives) 1:
{'CIN'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 2:
{'CLE', 'TEX'}
--------
2000
--------
Actual playoff qualifiers in 2000:
{'STL', 'ATL', 'SEA', 'OAK', 'SFG', 'CWS', 'NYM', 'NYY'}
Predicted playoff qualifiers using Logistic Regression in 2000:
{'SFG', 'ATL'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'STL', 'SEA', 'OAK', 'CWS', 'NYM', 'NYY'}
--------
2001
--------
Actual playoff qualifiers in 2001:
{'STL', 'CLE', 'ATL', 'SEA', 'HOU', 'OAK', 'NYY', 'ARI'}
Predicted playoff qualifiers using Logistic Regression in 2001:
{'OAK', 'ARI', 'SEA', 'NYY'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'STL', 'CLE', 'HOU', 'ATL'}
--------
2002
--------
Actual playoff qualifiers in 2002:
{'STL', 'MIN', 'LAA', 'ATL', 'OAK', 'NYY', 'SFG', 'ARI'}
Predicted playoff qualifiers using Logistic Regression in 2002:
{'STL', 'BOS', 'LAA', 'ATL', 'SEA', 'OAK', 'NYY', 'SFG', 'ARI'}
Incorrect predictions using Logistic Regression (false positives) 2:
{'BOS', 'SEA'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 1:
{'MIN'}
--------
2003
--------
Actual playoff qualifiers in 2003:
{'BOS', 'MIN', 'ATL', 'OAK', 'SFG', 'CHC', 'MIA', 'NYY'}
Predicted playoff qualifiers using Logistic Regression in 2003:
{'SFG', 'OAK', 'NYY', 'SEA'}
Incorrect predictions using Logistic Regression (false positives) 1:
{'SEA'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 5:
{'BOS', 'MIN', 'ATL', 'CHC', 'MIA'}
--------
2004
--------
Actual playoff qualifiers in 2004:
{'STL', 'BOS', 'MIN', 'LAA', 'ATL', 'HOU', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2004:
{'STL', 'ATL'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'BOS', 'MIN', 'LAA', 'HOU', 'NYY', 'LAD'}
--------
2005
--------
Actual playoff qualifiers in 2005:
{'STL', 'BOS', 'LAA', 'ATL', 'HOU', 'CWS', 'SDP', 'NYY'}
Predicted playoff qualifiers using Logistic Regression in 2005:
{'STL', 'CLE', 'LAA', 'HOU', 'CWS'}
Incorrect predictions using Logistic Regression (false positives) 1:
{'CLE'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'SDP', 'BOS', 'NYY', 'ATL'}
--------
2006
--------
Actual playoff qualifiers in 2006:
{'DET', 'STL', 'MIN', 'OAK', 'NYM', 'SDP', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2006:
{'NYM', 'NYY'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'DET', 'STL', 'MIN', 'OAK', 'SDP', 'LAD'}
--------
2007
--------
Actual playoff qualifiers in 2007:
{'CLE', 'BOS', 'LAA', 'COL', 'PHI', 'NYY', 'CHC', 'ARI'}
Predicted playoff qualifiers using Logistic Regression in 2007:
{'BOS', 'NYY', 'NYM'}
Incorrect predictions using Logistic Regression (false positives) 1:
{'NYM'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'CLE', 'LAA', 'COL', 'PHI', 'CHC', 'ARI'}
--------
2008
--------
Actual playoff qualifiers in 2008:
{'BOS', 'LAA', 'PHI', 'TBR', 'CWS', 'CHC', 'MIL', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2008:
{'BOS', 'TOR', 'PHI', 'TBR', 'NYM', 'CHC', 'LAD'}
Incorrect predictions using Logistic Regression (false positives) 2:
{'TOR', 'NYM'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 3:
{'CWS', 'LAA', 'MIL'}
--------
2009
--------
Actual playoff qualifiers in 2009:
{'STL', 'BOS', 'MIN', 'LAA', 'COL', 'PHI', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2009:
{'BOS', 'NYY', 'LAD', 'ATL'}
Incorrect predictions using Logistic Regression (false positives) 1:
{'ATL'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 5:
{'STL', 'MIN', 'LAA', 'COL', 'PHI'}
--------
2010
--------
Actual playoff qualifiers in 2010:
{'MIN', 'ATL', 'PHI', 'TBR', 'SFG', 'CIN', 'NYY', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 2010:
{'STL', 'ATL', 'PHI', 'TBR', 'SDP', 'NYY'}
Incorrect predictions using Logistic Regression (false positives) 2:
{'STL', 'SDP'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'SFG', 'CIN', 'MIN', 'TEX'}
--------
2011
--------
Actual playoff qualifiers in 2011:
{'DET', 'STL', 'PHI', 'NYY', 'TBR', 'MIL', 'ARI', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 2011:
{'TBR', 'NYY', 'PHI', 'TEX'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'DET', 'STL', 'MIL', 'ARI'}
--------
2012
--------
Actual playoff qualifiers in 2012:
{'DET', 'STL', 'WSN', 'ATL', 'OAK', 'BAL', 'SFG', 'CIN', 'NYY', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 2012:
{'TBR', 'WSN', 'NYY', 'ATL'}
Incorrect predictions using Logistic Regression (false positives) 1:
{'TBR'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 7:
{'DET', 'STL', 'OAK', 'BAL', 'SFG', 'CIN', 'TEX'}
--------
2013
--------
Actual playoff qualifiers in 2013:
{'DET', 'CLE', 'PIT', 'BOS', 'STL', 'ATL', 'OAK', 'TBR', 'CIN', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2013:
{'BOS', 'ATL', 'OAK', 'CIN', 'TEX'}
Incorrect predictions using Logistic Regression (false positives) 1:
{'TEX'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'DET', 'CLE', 'PIT', 'STL', 'TBR', 'LAD'}
--------
2014
--------
Actual playoff qualifiers in 2014:
{'DET', 'PIT', 'STL', 'WSN', 'LAA', 'OAK', 'BAL', 'SFG', 'LAD', 'KCR'}
Predicted playoff qualifiers using Logistic Regression in 2014:
{'OAK', 'LAD', 'WSN'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 7:
{'DET', 'PIT', 'STL', 'LAA', 'BAL', 'SFG', 'KCR'}
--------
2015
--------
Actual playoff qualifiers in 2015:
{'PIT', 'STL', 'TOR', 'HOU', 'NYM', 'CHC', 'TEX', 'NYY', 'LAD', 'KCR'}
Predicted playoff qualifiers using Logistic Regression in 2015:
{'STL', 'TOR'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 8:
{'PIT', 'HOU', 'KCR', 'NYM', 'CHC', 'NYY', 'LAD', 'TEX'}
--------
2016
--------
Actual playoff qualifiers in 2016:
{'CLE', 'BOS', 'WSN', 'TOR', 'BAL', 'SFG', 'NYM', 'CHC', 'LAD', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 2016:
{'WSN', 'CHC'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 8:
{'CLE', 'BOS', 'TOR', 'BAL', 'SFG', 'NYM', 'LAD', 'TEX'}
--------
2017
--------
Actual playoff qualifiers in 2017:
{'CLE', 'BOS', 'MIN', 'WSN', 'COL', 'HOU', 'NYY', 'CHC', 'ARI', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2017:
{'CLE', 'ARI', 'LAD', 'NYY'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'BOS', 'MIN', 'WSN', 'COL', 'HOU', 'CHC'}
--------
2018
--------
Actual playoff qualifiers in 2018:
{'CLE', 'BOS', 'ATL', 'COL', 'HOU', 'OAK', 'CHC', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2018:
{'CLE', 'HOU', 'BOS', 'LAD'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'ATL', 'COL', 'OAK', 'CHC', 'MIL', 'NYY'}
--------
2019
--------
Actual playoff qualifiers in 2019:
{'STL', 'WSN', 'MIN', 'ATL', 'HOU', 'OAK', 'TBR', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2019:
{'HOU', 'LAD'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 8:
{'STL', 'WSN', 'MIN', 'ATL', 'OAK', 'TBR', 'MIL', 'NYY'}
--------
2021
--------
Actual playoff qualifiers in 2021:
{'STL', 'BOS', 'ATL', 'HOU', 'TBR', 'SFG', 'CWS', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2021:
{'SFG', 'LAD'}
Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 8:
{'STL', 'BOS', 'ATL', 'HOU', 'TBR', 'CWS', 'MIL', 'NYY'}
import warnings
warnings.filterwarnings("ignore")
# predict playoff qualifiers for a given year (1990-2021), except for 1994, 2020 season.
for year in range(1990, 2022):
if year != 1994 and year != 2020:
_df = df[df["yearID"] == year].copy()
X_year = _df[features]
# Predict using the trained XGBoosting model.
predicted_playoff_qualifiers_rfc = rfc.predict(X_year)
# Add the predicted_playoff_qualifier column to the copied DataFrame
_df["predicted_playoff_qualifier_rfc"] = predicted_playoff_qualifiers_rfc
print("--------")
print(year)
print("--------")
print("Actual playoff qualifiers in " + str(year) + ":")
actual = set(_df[_df["make_playoffs_True"] == 1]["franchID"])
print(actual)
print("Predicted playoff qualifiers using Random Forest in " + str(year) + ":")
predicted_rfc = set(_df[_df["predicted_playoff_qualifier_rfc"] == 1]["franchID"])
print(predicted_rfc)
print()
incorrect_rfc = predicted_rfc.difference(actual)
print("Incorrect predictions using Random Forest (false positives) " + str(len(incorrect_rfc)) + ":")
print(incorrect_rfc)
exclusions_rfc = actual.difference(predicted_rfc)
print("Incorrect exclusions from prediction using Random Forest (false negatives) " + str(len(exclusions_rfc)) + ":")
print(exclusions_rfc)
print()
print()
--------
1990
--------
Actual playoff qualifiers in 1990:
{'PIT', 'BOS', 'OAK', 'CIN'}
Predicted playoff qualifiers using Random Forest in 1990:
{'PIT', 'BOS', 'OAK', 'CIN'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
1991
--------
Actual playoff qualifiers in 1991:
{'PIT', 'TOR', 'MIN', 'ATL'}
Predicted playoff qualifiers using Random Forest in 1991:
{'PIT', 'TOR', 'MIN', 'ATL'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
1992
--------
Actual playoff qualifiers in 1992:
{'PIT', 'TOR', 'OAK', 'ATL'}
Predicted playoff qualifiers using Random Forest in 1992:
{'TOR', 'OAK', 'ATL'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 1:
{'PIT'}
--------
1993
--------
Actual playoff qualifiers in 1993:
{'TOR', 'CWS', 'PHI', 'ATL'}
Predicted playoff qualifiers using Random Forest in 1993:
{'DET', 'ATL', 'TOR', 'PHI', 'SFG', 'CWS'}
Incorrect predictions using Random Forest (false positives) 2:
{'DET', 'SFG'}
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
1995
--------
Actual playoff qualifiers in 1995:
{'CLE', 'BOS', 'ATL', 'SEA', 'COL', 'CIN', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 1995:
{'CLE', 'BOS', 'ATL', 'SEA', 'COL', 'CIN', 'NYY', 'LAD'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
1996
--------
Actual playoff qualifiers in 1996:
{'STL', 'CLE', 'ATL', 'BAL', 'SDP', 'NYY', 'LAD', 'TEX'}
Predicted playoff qualifiers using Random Forest in 1996:
{'STL', 'CLE', 'ATL', 'BAL', 'SDP', 'NYY', 'LAD', 'TEX'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
1997
--------
Actual playoff qualifiers in 1997:
{'CLE', 'ATL', 'SEA', 'HOU', 'BAL', 'SFG', 'MIA', 'NYY'}
Predicted playoff qualifiers using Random Forest in 1997:
{'CLE', 'ATL', 'SEA', 'HOU', 'BAL', 'SFG', 'MIA', 'NYY'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
1998
--------
Actual playoff qualifiers in 1998:
{'CLE', 'BOS', 'ATL', 'HOU', 'CHC', 'SDP', 'NYY', 'TEX'}
Predicted playoff qualifiers using Random Forest in 1998:
{'CLE', 'BOS', 'ATL', 'HOU', 'CHC', 'SDP', 'NYY', 'TEX'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
1999
--------
Actual playoff qualifiers in 1999:
{'CLE', 'BOS', 'ATL', 'HOU', 'NYY', 'NYM', 'ARI', 'TEX'}
Predicted playoff qualifiers using Random Forest in 1999:
{'CLE', 'BOS', 'ATL', 'HOU', 'NYY', 'NYM', 'ARI', 'TEX'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2000
--------
Actual playoff qualifiers in 2000:
{'STL', 'ATL', 'SEA', 'OAK', 'SFG', 'CWS', 'NYM', 'NYY'}
Predicted playoff qualifiers using Random Forest in 2000:
{'STL', 'ATL', 'SEA', 'OAK', 'SFG', 'CWS', 'NYM', 'NYY'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2001
--------
Actual playoff qualifiers in 2001:
{'STL', 'CLE', 'ATL', 'SEA', 'HOU', 'OAK', 'NYY', 'ARI'}
Predicted playoff qualifiers using Random Forest in 2001:
{'STL', 'CLE', 'ATL', 'SEA', 'HOU', 'OAK', 'NYY', 'ARI'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2002
--------
Actual playoff qualifiers in 2002:
{'STL', 'MIN', 'LAA', 'ATL', 'OAK', 'NYY', 'SFG', 'ARI'}
Predicted playoff qualifiers using Random Forest in 2002:
{'STL', 'MIN', 'LAA', 'ATL', 'OAK', 'NYY', 'SFG', 'ARI'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2003
--------
Actual playoff qualifiers in 2003:
{'BOS', 'MIN', 'ATL', 'OAK', 'SFG', 'CHC', 'MIA', 'NYY'}
Predicted playoff qualifiers using Random Forest in 2003:
{'STL', 'BOS', 'MIN', 'ATL', 'OAK', 'SFG', 'CHC', 'MIA', 'NYY'}
Incorrect predictions using Random Forest (false positives) 1:
{'STL'}
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2004
--------
Actual playoff qualifiers in 2004:
{'STL', 'BOS', 'MIN', 'LAA', 'ATL', 'HOU', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2004:
{'STL', 'BOS', 'MIN', 'LAA', 'ATL', 'HOU', 'NYY', 'LAD'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2005
--------
Actual playoff qualifiers in 2005:
{'STL', 'BOS', 'LAA', 'ATL', 'HOU', 'CWS', 'SDP', 'NYY'}
Predicted playoff qualifiers using Random Forest in 2005:
{'STL', 'BOS', 'ATL', 'HOU', 'CWS', 'SDP', 'NYY'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 1:
{'LAA'}
--------
2006
--------
Actual playoff qualifiers in 2006:
{'DET', 'STL', 'MIN', 'OAK', 'NYM', 'SDP', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2006:
{'DET', 'STL', 'MIN', 'OAK', 'NYM', 'SDP', 'NYY', 'LAD'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2007
--------
Actual playoff qualifiers in 2007:
{'CLE', 'BOS', 'LAA', 'COL', 'PHI', 'NYY', 'CHC', 'ARI'}
Predicted playoff qualifiers using Random Forest in 2007:
{'CLE', 'BOS', 'LAA', 'COL', 'PHI', 'NYY', 'CHC', 'ARI'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2008
--------
Actual playoff qualifiers in 2008:
{'BOS', 'LAA', 'PHI', 'TBR', 'CWS', 'CHC', 'MIL', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2008:
{'BOS', 'LAA', 'PHI', 'TBR', 'CWS', 'NYM', 'CHC', 'MIL'}
Incorrect predictions using Random Forest (false positives) 1:
{'NYM'}
Incorrect exclusions from prediction using Random Forest (false negatives) 1:
{'LAD'}
--------
2009
--------
Actual playoff qualifiers in 2009:
{'STL', 'BOS', 'MIN', 'LAA', 'COL', 'PHI', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2009:
{'STL', 'BOS', 'MIN', 'LAA', 'COL', 'PHI', 'NYY', 'LAD'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2010
--------
Actual playoff qualifiers in 2010:
{'MIN', 'ATL', 'PHI', 'TBR', 'SFG', 'CIN', 'NYY', 'TEX'}
Predicted playoff qualifiers using Random Forest in 2010:
{'MIN', 'ATL', 'PHI', 'TBR', 'SFG', 'CIN', 'NYY', 'TEX'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2011
--------
Actual playoff qualifiers in 2011:
{'DET', 'STL', 'PHI', 'NYY', 'TBR', 'MIL', 'ARI', 'TEX'}
Predicted playoff qualifiers using Random Forest in 2011:
{'DET', 'STL', 'PHI', 'NYY', 'TBR', 'MIL', 'ARI', 'TEX'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2012
--------
Actual playoff qualifiers in 2012:
{'DET', 'STL', 'WSN', 'ATL', 'OAK', 'BAL', 'SFG', 'CIN', 'NYY', 'TEX'}
Predicted playoff qualifiers using Random Forest in 2012:
{'DET', 'STL', 'WSN', 'ATL', 'OAK', 'BAL', 'SFG', 'CIN', 'NYY', 'TEX'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2013
--------
Actual playoff qualifiers in 2013:
{'DET', 'CLE', 'PIT', 'BOS', 'STL', 'ATL', 'OAK', 'TBR', 'CIN', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2013:
{'DET', 'CLE', 'PIT', 'BOS', 'STL', 'ATL', 'OAK', 'TBR', 'CIN', 'LAD'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2014
--------
Actual playoff qualifiers in 2014:
{'DET', 'PIT', 'STL', 'WSN', 'LAA', 'OAK', 'BAL', 'SFG', 'LAD', 'KCR'}
Predicted playoff qualifiers using Random Forest in 2014:
{'DET', 'PIT', 'STL', 'WSN', 'LAA', 'OAK', 'BAL', 'SFG', 'LAD', 'KCR'}
Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()
--------
2015
--------
Actual playoff qualifiers in 2015:
{'PIT', 'STL', 'TOR', 'HOU', 'NYM', 'CHC', 'TEX', 'NYY', 'LAD', 'KCR'}
Predicted playoff qualifiers using Random Forest in 2015:
{'PIT', 'STL', 'WSN', 'TOR', 'HOU', 'SFG', 'NYM', 'CHC', 'LAD'}
Incorrect predictions using Random Forest (false positives) 2:
{'SFG', 'WSN'}
Incorrect exclusions from prediction using Random Forest (false negatives) 3:
{'KCR', 'NYY', 'TEX'}
--------
2016
--------
Actual playoff qualifiers in 2016:
{'CLE', 'BOS', 'WSN', 'TOR', 'BAL', 'SFG', 'NYM', 'CHC', 'LAD', 'TEX'}
Predicted playoff qualifiers using Random Forest in 2016:
{'CLE', 'BOS', 'WSN', 'SEA', 'TOR', 'SFG', 'CHC', 'LAD'}
Incorrect predictions using Random Forest (false positives) 1:
{'SEA'}
Incorrect exclusions from prediction using Random Forest (false negatives) 3:
{'NYM', 'TEX', 'BAL'}
--------
2017
--------
Actual playoff qualifiers in 2017:
{'CLE', 'BOS', 'MIN', 'WSN', 'COL', 'HOU', 'NYY', 'CHC', 'ARI', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2017:
{'STL', 'CLE', 'BOS', 'WSN', 'HOU', 'NYY', 'CHC', 'ARI', 'LAD'}
Incorrect predictions using Random Forest (false positives) 1:
{'STL'}
Incorrect exclusions from prediction using Random Forest (false negatives) 2:
{'COL', 'MIN'}
--------
2018
--------
Actual playoff qualifiers in 2018:
{'CLE', 'BOS', 'ATL', 'COL', 'HOU', 'OAK', 'CHC', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2018:
{'CLE', 'BOS', 'WSN', 'ATL', 'HOU', 'OAK', 'TBR', 'CHC', 'MIL', 'NYY', 'LAD'}
Incorrect predictions using Random Forest (false positives) 2:
{'TBR', 'WSN'}
Incorrect exclusions from prediction using Random Forest (false negatives) 1:
{'COL'}
--------
2019
--------
Actual playoff qualifiers in 2019:
{'STL', 'WSN', 'MIN', 'ATL', 'HOU', 'OAK', 'TBR', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2019:
{'STL', 'CLE', 'WSN', 'MIN', 'ATL', 'HOU', 'OAK', 'TBR', 'NYY', 'LAD'}
Incorrect predictions using Random Forest (false positives) 1:
{'CLE'}
Incorrect exclusions from prediction using Random Forest (false negatives) 1:
{'MIL'}
--------
2021
--------
Actual playoff qualifiers in 2021:
{'STL', 'BOS', 'ATL', 'HOU', 'TBR', 'SFG', 'CWS', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2021:
{'ATL', 'TOR', 'HOU', 'OAK', 'TBR', 'SFG', 'CWS', 'MIL', 'NYY', 'LAD'}
Incorrect predictions using Random Forest (false positives) 2:
{'TOR', 'OAK'}
Incorrect exclusions from prediction using Random Forest (false negatives) 2:
{'STL', 'BOS'}